INTRODUCTION

This short 2-hour (2 hour) workshop introduces a few basic methods for the analysis of DArTseq marker data produced by diversity Arrays Technology (DArT).

 

1 Workshop Organizers and Speakers

 

2 Seeds for Resilience Sweetpotato Project | Darwin Clean and Share Sweetpotato Project

 

 

 

 


2.1 Diversity Arrays Technology (DArT)

 

 

https://www.diversityarrays.com

 

Diversity Arrays is a company in Canberra, Australia who uses their proprietary DArT platform as a service provider of markers for any organism – microbes, animals, plants. They have a large bioinformatics capacity and are constantly developing new tools to aid in marker development and analysis.

What are DArTseq markers?

DArTseq markers are molecular markers produced by the DArT platform. This platform provides a generic and cost-effective genotyping technology that detects all types of polymorphism (SNP, InDels, CNV, methylation). Invented by Andrzej Kilian to overcome some of the limitations of other molecular marker technologies such as RFLP, AFLP and SSR. It uses a complexity reduction method which targets hypomethylated regions of the genome and removes highly repetitive regions that may cause difficulties in sequencing and marker detection.

NOTE:

DArT and DArTseq markers

DArT markers were created in 2001 using microarray technology. This name “DArT” has remained but DArTseq markers now use a more modern technology of reduced representation sequencing.

 

DArTseq datasets

The Report containing the data from DArT is normally download from the official page of Diversity Arrays Technology Pty. Ltd.(https://www.diversityarrays.com). For this workshop the dataset will be provided by email in a csv format.

DArT provides three different datasets.

– 1. SilicoDArT(Presence/Absence)

– 2. SNPs (Single Nucleotide Polymorphism)

    1. One-Row format
    1. Two-row format

 


SilicoDArT markers

SilicoDArTs are presence/absence (dominant) markers obtained by the successful restriction of genomic DNA.

  • 1 Presence
  • 0 Absence
  • - Missing data

 

GLOSSARY:

What the word silico stands for?

Silico in SilicoDArT comes from “in silicon” and means “carrying out experiments on computer or via computer simulation”, modeled on the expressions in vitro and in vivo.

 


THE SILICODART REPORT

 

Original report

The following table represents how you will actually see your data if you open the csv data in an Excel Spreadsheet

 

Original SilicoDArT.csv file.

Figure 2.1: Original SilicoDArT.csv file.

 

Colored report

The following is a colored example of the report for SilicoDArT and its components, opened in an Excel spreadsheet.

 

The report includes columns with the following information:

  • In blue are columns with marker IDs and sequence,
  • In yellow are columns with mapping position and scores,
  • In green are columns with markers stats (CallRate, Frequencies, Reproducibility, etc).

 

Sample and Marker IDs

 

Contrary to other datasets for molecular markers where sample IDs are in rows and Marker IDs are in columns. Diversity Arrays produces so many markers that it is necessary to flip the order to accommodate more IDs for markers.

 

Empty spaces

 

This cells do not contain any information, but taking note of how many asterisks(*) there are is useful for loading the dataset later:

  • In this case, the dataset has 14 horizontal and 6 vertical asterisks (*).

 

Information from shipped plates

 

When sending DNA to diversity arrays, we do so in 96-well PCR plates. Every samples has a specific position on this Plate for column and row.

 

Presence/Absence/Missing-data

 

As mention before, in this SilicoDArT dataset, presence are represented with “1”, Absence with”0” and Missing data with “-”.

 

Columns with marker IDs and sequence

 

When expanding the blue columns, you can see the sequence information in the Allele Sequence column and Trimmed Sequence with more detail.

CloneID is the Unique identifier given to every marker by Diversity Arrays, and the AlleleSequence is the sequence for the Allele where the marker is positioned. This blue columns also contain the Trimmed sequence for when the AlleleSequence surpasses certain length.

 

Columns with mapping position and scores

 

When expanding the yellow columns, you can see the information for chromosome position, mapping and scores.

When the sequence exists for the genome of the organism of interest, we can know the chromosome position.

 

 

Columns with markers stats

 

The green columns contain the quality information for the SilicoDArT markers. Stats such as CallRate, Frequencies, Reproducibility, etc.

 

Some of the most relevant information for this workshop is contained in the following columns:

  • CloneID – Unique identifier of the sequence tag which corresponds to each sample sent

  • CallRate – The proportion of samples for which the genotype call is either “1” or “0”, rather than “-”

  • Reproducibility – The proportion of technical replicate assay pairs for which the marker score is consistent

 

NOTE:

There is other information we will not be using today:

  • OneRatio –The proportion of samples for which the genotype score is “1”

  • PIC – Polymorphism Information Content

  • Qpmr – The average of the normalized non-zero tag read counts divided by the standard deviation of the normalized non-zero tag read counts (If standard deviation is zero or near zero, the Qpmr value will be 100)

 


 

2.2 R and RStudio IDE

Basic knowledge of R and RStudio are needed to perform any analysis yet our attempt is provide both a basic understanding of what markers can do but also to provide more detailed understanding of the use of the tools involved for the analysis. This workshop will provide an introduction to R and RStudio but to perform any analysis a user will need to familiarize and practice.

 

GLOSSARY:

IDE:

IDE or Integrated Development Environment, enables programmers to combine their activities to write software in a single application: editing source code, building executables, debugging, plotting.

 

 

 

 


 

2.2.1 Introductory courses for R and RStudio

Unfortunately, this workshop can only cover a basic use for R and Rtudio, but please look up this Introductory courses. The first two tutorials are recommended for beginners by the RStudio team but if you want to keep learning, RStudio has resources for Intermediates and Experts too.

  • Beginner-friendly installation instructions (ModernDive)

https://moderndive.netlify.app/1-getting-started.html

  • A tour of RStudio for new users (R-ladies Sydney)

https://rladiessydney.org/courses/ryouwithme/01-basicbasics-1/

  • If you want to keep learning please go to Finding your way to R (RStudio Education)

https://education.rstudio.com/learn/

 

2.2.2 RStudio pane layout

RStudio has 4 basic windows, also called panes:

  • The source pane: This pane is for writing and editing R scripts.

  • The Environment/History pane: This pane shows datasets and objects created in the Environment tab. The History tab contains the history of the R commands executed in the session.

  • The Console/ Terminal pane: This pane shows the execution of commands an output from the source pane in the Console tab. The Terminal tab gives access to the BASH terminal (Linux, not R).

  • The Files/Plots/Packages/ and Help pane: This is a multipurpose pane. The Files tab allows the user to navigate and set the working directory. The Plots tab show any plot generated as output. The Packages tab is for installing and loading R packages. and the Help tab displays the help files for R functions and packages.

 

 

(This is the default layout for RStudio, to learn how to customize it please go to https://support.posit.co/hc/en-us/articles/200549016-Customizing-RStudio)


 

2.2.3 Basic terminology

  • R-Functions: also called commands, Functions perform tasks in R. They take in inputs called arguments and return outputs. You can either manually specify a function’s arguments or use the function’s default values.

 

GLOSSARY:

Wrapper functions: Functions that call other functions that do the real work, but provides a different or more convenient interface, syntax or structure.

  • R-packages: are collections of functions and data sets developed by the R-community contributors. They improve and expand the basic built-in R functions.Typically, a package will include code (not only R code!), documentation for the package and the functions inside, some tests to check everything works as it should, and data sets.

  • R-Libraries: Directory which stores R Packages

  • R-Package Repositories: Free online collections of R documents that organize R packages to distribute for R-users so that they can install packages using the install.packages command. CRAN and Bioconductor are examples of R repositories.

 


 

2.2.4 What is dartR?

dartR [@R-dartR] is an R package for loading, filtering and in general manipulating and analysing DArTseq markers. It uses the format from the package adegenet [@R-adegenet] to analyse SilicoDArT and SNPs. Please notice that many of the scripts are wrappers for other already available packages.

The updated guides for dartR will be provided to you in pdf format. You can also download them or read online.

Package dartR

Figure 2.2: Package dartR

 

Introduction to dartR

Figure 2.3: Introduction to dartR

 

Useful tutorials

Figure 2.4: Useful tutorials

 


 

SWEETPOTATO VIRTUAL WORKSHOP 2023

  • This workshop covers basic use of R and Rstudio as well as the package dartR to analize genetic diversity of sweetpotato using DArTseq markers.

  • We will be using today LOW DENSITY sweetpotato SilicoDArT markers.

  • For this workshop we will use the Zambia data for this demonstration however, the original file of 259 samples has been trimmed to 200 samples to make the demonstration easier, given that some samples had possible genotyping problems.

  • Please use the updated versions for R and Rstudio.

  • The specific versions used in this workshop are R Version 4.2.0 and Rstudio Version is 2022.02.3.

 

“THE WORKFLOW”

 

This is the workflow we will be following to produce a phylogenetic tree with SilicoDArT markers for Zambian samples.

 

 

“THE GOAL”

 

This is what we will produce:

 


 

Before starting with Rstudio

Before starting we need to organize DArTseq files and directories.

The directory for the project

It is important to know the directory where our files are, so we can use it for the workshop. The directory called SwpWorkshop2023 provided to everyone contains the following :

  • files: Containing DArTseq reports

  • plots: For the outputs that will be produced in this workshop

Csv files

SilicoDArT and SNP files must have the .csv extension (comma-separated values file). Please take note of the number of rows with asterisks (*) and how missing data is presented, in this case a hyphen (-) .

 

The following table represents how you will actually see your data if you open the csv data in Excel Spreadsheet

 

Original SilicoDArT.csv file.

Figure 2.5: Original SilicoDArT.csv file.

 

NOTE:

When sending DNA Diversity Arrays please use unique names so as to not confuse samples when doing analyses. In case you send repetitions use sequential suffixes similar to:

  • sample_a, sample_b, sample_c

  • sample_1, sample_2, sample_3

 

Preparing an Individual metafile

You will normally have to prepare a file in Excel (or similar) which is useful for Data analysis and subsequent annotation. The compulsory headers are “id” and “pop” (It’s important to used these specific names), for latitude and longitude information, please use “lat” and “lon”. You can add any other columns you consider useful.

Please use “Unknown” in this file for samples which population is unknown, and save it as csv.

 

NOTE:

For this particular dataset the “population” is considered the same as the “country”, but it can be a variety, a phenotype or any useful classification such as root color or breeder.

 

Individual metafile with population information, in this case the country of origin is used as population.

Figure 2.6: Individual metafile with population information, in this case the country of origin is used as population.

 

Individual metafile with additional information for Zambian samples.

Figure 2.7: Individual metafile with additional information for Zambian samples.

 

3 Starting with Rstudio

To open Rstudio please click on this icon:

3.1 Creating an Rstudio project

To create a project, go to the File menu, and click the option New Project.

or just click on this Icon:

Since we have already created a Directory, click on the “Existing Directory” option:

To set the working directory, please Browse for the Folder “SwpWorkshop2023”: And then click on “Create Project”.

 

3.1.1 Creating a new R Script

DO NOT WORRY!

You do not need to write any scripts, we provide these below. You will receive a copy for all documentation, including graphics produced.

To create an R Script, go to the File menu, click the option New File, and then click on R Script.

or just click on this Icon:

 

3.1.2 Installing packages

You’ll need to install several R-packages before starting the Workshop

The necessary R-packages can be installed running the following code:

# You can install the following packages with dependencies.

install.packages(c("devtools", "ggplot2", "gridExtra", "gtable", "label.switching",
    "tidyr", "dplyr"), dependencies = T)

# Note:if some problems arise when using the libraries ggplot or dplyr, please
# install 'tydiverse'

# From CRAN Repositories

install.packages("adegenet")
install.packages("ade4")
install.packages("poppr")
install.packages("ape")
install.packages("BiocManager")

# From BiocManager

BiocManager::install(c("SNPRelate", "qvalue"))

BiocManager::install("ggtree")
# Install dartR from CRAN

install.packages("dartR")

# Install dartR from github

library("devtools")

install_github(green-striped-gecko/dartR)

4 Exploratory analysis

 

 

 

4.0.1 Loading dartR Library:

To start manipulating the datasets we will need to load the library dartR. We can load any library going to the Packages tab in the Multipurpose pane, searching and selecting it. Here we will also find the descriptions and use for the packages contained in said library.

 

 

Nonetheless, we recommend that the library is included in the Script, this way we will not forget which libraries we are using.

#####------Loading dartR library------####

library("dartR")

 

NOTE:

Setting and changing the working directory:

The R-project was created in a specific Folder, so the directory should be the same, but it is possible to change it when necessary.

#####—-In which directory am I?—-####

getwd()

If necessary, please change your working directory.

#####——–Change directory——–####

setwd(“/Users/MyUser/Swp_workshop_2023”) #This is only an example, please use the path of the directory you are using

 

4.1 Quality analysis for SilicoDArT

The first step is to analyze the information we have received from Diversity Arrays as they are.

IMPORTANT NOTE:.

For dartR:

  • What we consider MARKERS are called LOCUS, and

  • What we consider SAMPLES are called INDIVIDUALS.

For this Workshop these expressions will be treated as synonyms.

 

4.1.1 Loading the SilicoDArT csv file:

GLOSSARY:

A genlight object:

An object of the class genlight, a class for storing in a compact way genotypes of binary SNPs. Although, genlight can be used for any level of ploidy.

It is necessary to load the file we are working with. The function gl.read.silicodart allows us to merged the original csv file received from Diversity Arrays and the Individual metafile we created with additional information.

SilicoDArT Input

  • filename: – name of csv file containing the SilicoDArT data

  • topskip: – number of rows to skip before the header row (containing the specimen identities)

  • nas: – missing data character

  • ind.metafile: – name of csv file containing metadata assigned to each entity (individual)

 

#####--------SilicoDArT Input--------####

gl_silico_swp <- gl.read.silicodart(filename ="files/SilicoDArT_Swp_736m_200ind.csv",
                                    nas = "-",
                                    topskip = 6,
                                    ind.metafile = "files/ind.metafile_736m_200ind.csv")
## Starting gl.read.silicodart 
##   Reading data from file: files/SilicoDArT_Swp_736m_200ind.csv 
##     This may take some time, please wait!
##     Added the following locus metrics:
## CloneID AlleleSequence TrimmedSequence Chrom_SweetPotato_NSP323_v3 ChromPosTag_SweetPotato_NSP323_v3 AlnCnt_SweetPotato_NSP323_v3 AlnEvalue_SweetPotato_NSP323_v3 Strand_SweetPotato_NSP323_v3 CallRate OneRatio PIC AvgReadDepth StDevReadDepth Qpmr Reproducibility .
##   Recognised: 200 individuals and 736 Sequence Tags using files/SilicoDArT_Swp_736m_200ind.csv 
##   Starting conversion to a genlight object ....
##     Please note conversion of bigger data sets will take some time!
##     Once finished, we recommend you save the object using gl.save(object, file="object.rdata")
##   Adding individual metadata: files/ind.metafile_736m_200ind.csv .
##   Ids for individual metadata (at least a subset of) are matching!
##   Found  200 matching ids out of 200 ids provided in the ind.metadata file. Subsetting loci now!.
##      Added pop factor.
##     Added latlon data.
##     Added  id  to the other$ind.metrics slot.
##      Added  pop  to the other$ind.metrics slot.
##      Added  lat  to the other$ind.metrics slot.
##      Added  lon  to the other$ind.metrics slot.
##      Added  INSTCODE  to the other$ind.metrics slot.
##      Added  ACCENUMB  to the other$ind.metrics slot.
##      Added  acc_name  to the other$ind.metrics slot.
##      Added  collection_site  to the other$ind.metrics slot.
##      Added  province  to the other$ind.metrics slot.
##      Added  elevation  to the other$ind.metrics slot.
##      Added  Plate  to the other$ind.metrics slot.
##      Added  Genotype  to the other$ind.metrics slot.
##   Genlight object created to hold Tag P/A data
## Completed: gl.read.silicodart

 

NOTE:

We will normally have to make sure CloneIDs are unique, but the gl.read.silicodart function does it automatically.

 

#It's always a good idea to save the genlight object, 
#so we can use it directly later.

save(gl_silico_swp, file="files/gl_silico_swp.rdata")
#Next time you need the file, you can just load the genlight object you saved 
#by using the function "load"

load("files/gl_silico_swp.rdata")

Producing the first report to analyze quality of data

 

4.1.2 Displaying names

#Display names of for the gl object

names(gl_silico_swp@other$loc.metrics)
##  [1] "CloneID"                           "AlleleSequence"                   
##  [3] "TrimmedSequence"                   "Chrom_SweetPotato_NSP323_v3"      
##  [5] "ChromPosTag_SweetPotato_NSP323_v3" "AlnCnt_SweetPotato_NSP323_v3"     
##  [7] "AlnEvalue_SweetPotato_NSP323_v3"   "Strand_SweetPotato_NSP323_v3"     
##  [9] "CallRate"                          "OneRatio"                         
## [11] "PIC"                               "AvgReadDepth"                     
## [13] "StDevReadDepth"                    "Qpmr"                             
## [15] "Reproducibility"
#Display first ten individual names

indNames(gl_silico_swp[1:10])
##  [1] "ZM9083"   "ZM9051"   "ZM8963_1" "ZM9080"   "ZM9046"   "ZM9099"  
##  [7] "ZM8966_1" "ZM9086"   "ZM9064"   "ZM8964_1"

 

4.2 Reports for data quality

We do not know about the quality of the markers we received from Diversity Arrays for this analysis. So we will produced reports to know.

 

4.2.1 Monomorphs

GLOSSARY:

Monomorphs:

Monomorphs are markers which are identical for all samples, therefore not informative.

####--------Monomorphs Report--------####

gl.report.monomorphs(gl_silico_swp)
## Starting gl.report.monomorphs 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Identifying monomorphic loci
## 
##   No. of loci: 736 
##     Polymorphic loci: 712 
##     Monomorphic loci: 24 
##     Loci scored all NA: 0 
##   No. of individuals: 200 
##   No. of populations: 1 
## 
## Completed: gl.report.monomorphs

 

4.2.2 CallRate

GLOSSARY:

CallRate:

CallRate is the proportion of samples for which the genotype call is either “1” or “0” , rather than “-”.

CallRate per marker

####---------CallRate Report---------####

gl.report.callrate(gl_silico_swp, method = "loc")
## Starting gl.report.callrate 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Reporting Call Rate by Locus
##   No. of loci = 736 
##   No. of individuals = 200 
##     Minimum      :  0.835 
##     1st quartile :  0.91 
##     Median       :  0.95 
##     Mean         :  0.9483424 
##     3r quartile  :  0.995 
##     Maximum      :  1 
##     Missing Rate Overall:  0.0517
##    Quantile Threshold Retained Percent Filtered Percent
## 1      100%     1.000      143    19.4      593    80.6
## 2       95%     1.000      143    19.4      593    80.6
## 3       90%     1.000      143    19.4      593    80.6
## 4       85%     1.000      143    19.4      593    80.6
## 5       80%     0.995      186    25.3      550    74.7
## 6       75%     0.995      186    25.3      550    74.7
## 7       70%     0.990      225    30.6      511    69.4
## 8       65%     0.985      265    36.0      471    64.0
## 9       60%     0.980      300    40.8      436    59.2
## 10      55%     0.970      338    45.9      398    54.1
## 11      50%     0.950      372    50.5      364    49.5
## 12      45%     0.935      409    55.6      327    44.4
## 13      40%     0.925      459    62.4      277    37.6
## 14      35%     0.920      496    67.4      240    32.6
## 15      30%     0.915      523    71.1      213    28.9
## 16      25%     0.910      555    75.4      181    24.6
## 17      20%     0.905      592    80.4      144    19.6
## 18      15%     0.900      626    85.1      110    14.9
## 19      10%     0.890      681    92.5       55     7.5
## 20       5%     0.880      710    96.5       26     3.5
## 21       0%     0.835      736   100.0        0     0.0
## 
## 
## Completed: gl.report.callrate

The CallRate of the 736 markers is above 0.83. This is a good indication of quality!

CallRate per individual

####---------CallRate Report---------####

gl.report.callrate(gl_silico_swp, method = "ind")
## Starting gl.report.callrate 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Reporting Call Rate by Individual
##   No. of loci = 736 
##   No. of individuals = 200 
##     Minimum      :  0.8682065 
##     1st quartile :  0.9252717 
##     Median       :  0.9456522 
##     Mean         :  0.9483424 
##     3r quartile  :  0.9782609 
##     Maximum      :  0.9972826 
##     Missing Rate Overall:  0.0517
##    Quantile Threshold Retained Percent Filtered Percent
## 1      100% 0.9972826        2     1.0      198    99.0
## 2       95% 0.9932745       10     5.0      190    95.0
## 3       90% 0.9891304       22    11.0      178    89.0
## 4       85% 0.9864130       31    15.5      169    84.5
## 5       80% 0.9823370       42    21.0      158    79.0
## 6       75% 0.9782609       52    26.0      148    74.0
## 7       70% 0.9705163       60    30.0      140    70.0
## 8       65% 0.9605978       74    37.0      126    63.0
## 9       60% 0.9557065       80    40.0      120    60.0
## 10      55% 0.9510870       91    45.5      109    54.5
## 11      50% 0.9456522      102    51.0       98    49.0
## 12      45% 0.9415761      111    55.5       89    44.5
## 13      40% 0.9355978      120    60.0       80    40.0
## 14      35% 0.9315897      130    65.0       70    35.0
## 15      30% 0.9279891      144    72.0       56    28.0
## 16      25% 0.9252717      152    76.0       48    24.0
## 17      20% 0.9211957      165    82.5       35    17.5
## 18      15% 0.9196332      170    85.0       30    15.0
## 19      10% 0.9130435      182    91.0       18     9.0
## 20       5% 0.9021739      192    96.0        8     4.0
## 21       0% 0.8682065      200   100.0        0     0.0
## 
## 
##  ind_name    pop      missing_data
##    ZM9218 Zambia 0.997282608695652
##    ZM8974 Zambia 0.997282608695652
##    ZM9083 Zambia 0.995923913043478
##    ZM9061 Zambia 0.995923913043478
##    ZM9175 Zambia 0.995923913043478
##    ZM9046 Zambia 0.994565217391304
##    ZM9049 Zambia 0.994565217391304
##    ZM9210 Zambia 0.994565217391304
##    ZM9211 Zambia 0.994565217391304
##    ZM9011 Zambia 0.994565217391304
##    ZM9063 Zambia  0.99320652173913
##    ZM9035 Zambia  0.99320652173913
##    ZM9002 Zambia 0.991847826086957
##    ZM9000 Zambia 0.990489130434783
##    ZM8972 Zambia 0.990489130434783
##    ZM9010 Zambia 0.990489130434783
##    ZM9064 Zambia 0.989130434782609
##    ZM9087 Zambia 0.989130434782609
##  ZM8965_1 Zambia 0.989130434782609
##    ZM9217 Zambia 0.989130434782609
##    ZM9006 Zambia 0.989130434782609
##    ZM8979 Zambia 0.989130434782609
##    ZM9080 Zambia 0.987771739130435
##    ZM9038 Zambia 0.987771739130435
##    ZM9047 Zambia 0.987771739130435
##    ZM9176 Zambia 0.987771739130435
##    ZM9003 Zambia 0.987771739130435
##    ZM8986 Zambia 0.987771739130435
##    ZM9055 Zambia 0.986413043478261
##    ZM9220 Zambia 0.986413043478261
##    ZM9171 Zambia 0.986413043478261
##    ZM9073 Zambia 0.985054347826087
##    ZM9037 Zambia 0.985054347826087
##    ZM9058 Zambia 0.985054347826087
##    ZM8980 Zambia 0.985054347826087
##    ZM8996 Zambia 0.985054347826087
##    ZM8995 Zambia 0.985054347826087
##    ZM9091 Zambia 0.983695652173913
##    ZM9036 Zambia 0.983695652173913
##  ZM8964_1 Zambia 0.982336956521739
##    ZM9170 Zambia 0.982336956521739
##    ZM9200 Zambia 0.982336956521739
##  ZM8966_1 Zambia 0.980978260869565
##    ZM9081 Zambia 0.980978260869565
##    ZM9060 Zambia 0.980978260869565
##    ZM8993 Zambia 0.980978260869565
##    ZM8998 Zambia 0.979619565217391
##    ZM9092 Zambia 0.978260869565217
##    ZM9216 Zambia 0.978260869565217
##    ZM9213 Zambia 0.978260869565217
##    ZM9205 Zambia 0.978260869565217
##    ZM9193 Zambia 0.978260869565217
##    ZM9051 Zambia  0.97554347826087
##    ZM9054 Zambia  0.97554347826087
##    ZM9056 Zambia  0.97554347826087
##    ZM9214 Zambia  0.97554347826087
##    ZM9042 Zambia 0.972826086956522
##    ZM9019 Zambia 0.971467391304348
##    ZM9065 Zambia 0.971467391304348
##  ZM8965_2 Zambia 0.971467391304348
##    ZM9040 Zambia 0.970108695652174
##    ZM9173 Zambia           0.96875
##    ZM9172 Zambia 0.967391304347826
##    ZM8970 Zambia 0.964673913043478
##    ZM8990 Zambia 0.963315217391304
##    ZM9174 Zambia  0.96195652173913
##    ZM8987 Zambia  0.96195652173913
##    ZM9013 Zambia  0.96195652173913
##    ZM9031 Zambia 0.960597826086957
##    ZM9198 Zambia 0.960597826086957
##    ZM9004 Zambia 0.960597826086957
##    ZM8981 Zambia 0.960597826086957
##    ZM9009 Zambia 0.960597826086957
##  ZM8963_2 Zambia 0.960597826086957
##    ZM8967 Zambia 0.959239130434783
##  ZM8964_2 Zambia 0.959239130434783
##    ZM9209 Zambia 0.957880434782609
##    ZM9076 Zambia 0.956521739130435
##    ZM9222 Zambia 0.956521739130435
##    ZM8984 Zambia 0.956521739130435
##    ZM9165 Zambia 0.955163043478261
##    ZM9219 Zambia 0.955163043478261
##    ZM8982 Zambia 0.955163043478261
##    ZM9007 Zambia 0.955163043478261
##    ZM9027 Zambia 0.953804347826087
##    ZM9077 Zambia 0.953804347826087
##    ZM8999 Zambia 0.953804347826087
##    ZM9029 Zambia 0.952445652173913
##    ZM8978 Zambia 0.952445652173913
##    ZM9164 Zambia 0.951086956521739
##    ZM8992 Zambia 0.951086956521739
##    ZM9086 Zambia 0.949728260869565
##    ZM9106 Zambia 0.949728260869565
##    ZM9157 Zambia 0.948369565217391
##    ZM9075 Zambia 0.948369565217391
##    ZM9085 Zambia 0.947010869565217
##    ZM9088 Zambia 0.947010869565217
##    ZM9001 Zambia 0.947010869565217
##    ZM9196 Zambia 0.945652173913043
##    ZM9184 Zambia 0.945652173913043
##    ZM9207 Zambia 0.945652173913043
##    ZM8968 Zambia 0.945652173913043
##    ZM9067 Zambia  0.94429347826087
##    ZM8973 Zambia  0.94429347826087
##    ZM9208 Zambia 0.942934782608696
##    ZM9180 Zambia 0.942934782608696
##    ZM9138 Zambia 0.941576086956522
##    ZM9148 Zambia 0.941576086956522
##    ZM9166 Zambia 0.941576086956522
##    ZM8994 Zambia 0.941576086956522
##    ZM8985 Zambia 0.941576086956522
##    ZM9185 Zambia 0.940217391304348
##  ZM8966_2 Zambia 0.940217391304348
##    ZM9014 Zambia 0.938858695652174
##    ZM9195 Zambia 0.938858695652174
##    ZM9016 Zambia            0.9375
##    ZM9107 Zambia            0.9375
##    ZM8975 Zambia            0.9375
##    ZM9070 Zambia 0.936141304347826
##    ZM9044 Zambia 0.936141304347826
##    ZM9071 Zambia 0.934782608695652
##    ZM9144 Zambia 0.934782608695652
##    ZM9168 Zambia 0.934782608695652
##    ZM9169 Zambia 0.934782608695652
##    ZM9215 Zambia 0.934782608695652
##    ZM8971 Zambia 0.934782608695652
##    ZM9201 Zambia 0.933423913043478
##    ZM9206 Zambia 0.933423913043478
##    ZM9128 Zambia 0.932065217391304
##    ZM9189 Zambia 0.932065217391304
##    ZM9099 Zambia  0.93070652173913
##    ZM9121 Zambia  0.93070652173913
##    ZM9162 Zambia  0.93070652173913
##    ZM9181 Zambia  0.93070652173913
##    ZM9177 Zambia  0.93070652173913
##    ZM9005 Zambia  0.93070652173913
##    ZM9161 Zambia 0.929347826086957
##    ZM9192 Zambia 0.929347826086957
##    ZM9023 Zambia 0.927989130434783
##    ZM9131 Zambia 0.927989130434783
##    ZM9059 Zambia 0.927989130434783
##    ZM9066 Zambia 0.927989130434783
##    ZM9182 Zambia 0.927989130434783
##    ZM8988 Zambia 0.927989130434783
##    ZM9145 Zambia 0.926630434782609
##    ZM8991 Zambia 0.926630434782609
##    ZM8977 Zambia 0.926630434782609
##    ZM8997 Zambia 0.926630434782609
##    ZM9043 Zambia 0.925271739130435
##    ZM9134 Zambia 0.925271739130435
##    ZM9190 Zambia 0.925271739130435
##    ZM9084 Zambia 0.925271739130435
##    ZM9020 Zambia 0.923913043478261
##    ZM9199 Zambia 0.923913043478261
##    ZM9221 Zambia 0.923913043478261
##    ZM9012 Zambia 0.923913043478261
##    ZM9187 Zambia 0.922554347826087
##    ZM9097 Zambia 0.921195652173913
##    ZM9041 Zambia 0.921195652173913
##    ZM9079 Zambia 0.921195652173913
##    ZM9057 Zambia 0.921195652173913
##    ZM9204 Zambia 0.921195652173913
##    ZM9033 Zambia 0.921195652173913
##    ZM8983 Zambia 0.921195652173913
##    ZM8976 Zambia 0.921195652173913
##    ZM9109 Zambia 0.919836956521739
##    ZM9048 Zambia 0.919836956521739
##    ZM9118 Zambia 0.919836956521739
##    ZM9039 Zambia 0.919836956521739
##    ZM9212 Zambia 0.919836956521739
##    ZM9117 Zambia 0.918478260869565
##    ZM9158 Zambia 0.918478260869565
##    ZM9034 Zambia 0.918478260869565
##    ZM9115 Zambia 0.915760869565217
##    ZM9188 Zambia 0.915760869565217
##    ZM9017 Zambia 0.914402173913043
##    ZM9053 Zambia 0.914402173913043
##    ZM9194 Zambia 0.914402173913043
##    ZM9135 Zambia  0.91304347826087
##    ZM9130 Zambia  0.91304347826087
##    ZM9156 Zambia  0.91304347826087
##    ZM9110 Zambia  0.91304347826087
##    ZM9101 Zambia 0.911684782608696
##    ZM9114 Zambia 0.908967391304348
##    ZM9069 Zambia 0.908967391304348
##  ZM8963_1 Zambia 0.907608695652174
##    ZM9155 Zambia           0.90625
##    ZM9146 Zambia 0.904891304347826
##    ZM9021 Zambia 0.902173913043478
##    ZM9103 Zambia 0.902173913043478
##    ZM9149 Zambia 0.902173913043478
##    ZM9178 Zambia 0.902173913043478
##    ZM9147 Zambia 0.894021739130435
##    ZM9153 Zambia 0.892663043478261
##    ZM9098 Zambia 0.892663043478261
##    ZM9163 Zambia 0.892663043478261
##    ZM9151 Zambia 0.891304347826087
##    ZM9191 Zambia 0.877717391304348
##    ZM9089 Zambia 0.876358695652174
##    ZM9167 Zambia  0.86820652173913
## Completed: gl.report.callrate

 

The CallRate of the 200 individuals is above 0.86. This is also a good indication of quality!

 

4.2.3 Reproducibility

GLOSSARY:

Reproducibility:

Reproducibility is the proportion of technical replicate assay pairs for which the marker score is consistent.

####---------Reproducibility Report---------####

gl.report.reproducibility(gl_silico_swp)
## Starting gl.report.reproducibility 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Reporting Repeatability by Locus
##   No. of loci = 736 
##   No. of individuals = 200 
##     Minimum      :  0.95 
##     1st quartile :  0.9810412 
##     Median       :  1 
##     Mean         :  0.989915 
##     3r quartile  :  1 
##     Maximum      :  1 
##     Missing Rate Overall:  0.05
##    Quantile Threshold Retained Percent Filtered Percent
## 1      100%  1.000000      504    68.5      232    31.5
## 2       95%  1.000000      504    68.5      232    31.5
## 3       90%  1.000000      504    68.5      232    31.5
## 4       85%  1.000000      504    68.5      232    31.5
## 5       80%  1.000000      504    68.5      232    31.5
## 6       75%  1.000000      504    68.5      232    31.5
## 7       70%  1.000000      504    68.5      232    31.5
## 8       65%  1.000000      504    68.5      232    31.5
## 9       60%  1.000000      504    68.5      232    31.5
## 10      55%  1.000000      504    68.5      232    31.5
## 11      50%  1.000000      504    68.5      232    31.5
## 12      45%  1.000000      504    68.5      232    31.5
## 13      40%  1.000000      504    68.5      232    31.5
## 14      35%  1.000000      504    68.5      232    31.5
## 15      30%  0.982456      519    70.5      217    29.5
## 16      25%  0.980769      557    75.7      179    24.3
## 17      20%  0.965517      594    80.7      142    19.3
## 18      15%  0.964286      641    87.1       95    12.9
## 19      10%  0.962963      670    91.0       66     9.0
## 20       5%  0.958333      702    95.4       34     4.6
## 21       0%  0.950000      736   100.0        0     0.0
## Completed: gl.report.reproducibility

NOTE:

Why is Reproducibility important?

Reproducibility gives accuracy in the analysis, allowing you to obtain the same result everytime you run it.

Now that we have produced these Reports, we know more about the quality of our SilicoDArT markers

 


 

 

 

4.3 Creating objects

  • While it may not make sense right now, keeping the original genlight object untouched is always a good idea.

  • It’s also a good idea to keep every other genlight object modified by filters, this way, if we made a mistake we can just go back a single step and not do everything again.

####----change to gl2 object---####

gl2 <- gl_silico_swp

#You can see your objects in the "Environment panel"
Environment panel

Figure 4.1: Environment panel

5 Filtering markers

 

5.1 Filtering monomorphs

#Taking away non-informative markers

gl3 <- gl.filter.monomorphs(gl2, v=0)

#How many markers are left after taken away the monomorphs?

nLoc(gl3)
## [1] 712

 

5.2 Filtering by reproducibility

#Since reproducibility is very high for this particular dataset,
#we can go ahead and used the threshold=0.96

gl4 <- gl.filter.reproducibility(gl3, threshold=0.96)
## Starting gl.filter.reproducibility 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Removing loci with repeatability less than 0.96

## Completed: gl.filter.reproducibility

#How many markers are left after filtering reproducibility=0.96?

nLoc(gl4)
## [1] 665

 

5.3 Filtering by callrate

gl5 <- gl.filter.callrate(gl4, method = "loc", threshold = 0.90, recalc=TRUE)
## Starting gl.filter.callrate 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Recalculating Call Rate
##   Removing loci based on Call Rate, threshold = 0.9 
## Starting gl.recalc.metrics 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
## Starting utils.recalc.avgpic 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Recalculating OneRatio, PIC
## Completed: utils.recalc.avgpic 
## Starting utils.recalc.callrate 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Recalculating locus metric CallRate
## Completed: utils.recalc.callrate 
##   Locus metrics recalculated
## Completed: gl.recalc.metrics

## Completed: gl.filter.callrate

#How many markers are left after filtering callrate=0.90?

nLoc(gl5)
## [1] 575

Although we are reducing the number of markers by 22% by filtering, this improves the quality of data.

 

NOTE:

When the dataset is too large, it is necessary to add the option “plot=FALSE” , because it will take to much time to run the plot or it could cause an Error.

gl5 <- gl.filter.callrate(gl4, method = “loc”, threshold = 0.90 , plot = FALSE)

THIS IS AN EXAMPLE, DO NOT RUN THIS TIME

 

5.4 Saving filtered data

As well as saving the first genlight object produced. Saving the filtered genlight object will prove useful in the future.

#Save gl object with filtered data

save(gl5, file="files/gl5_575markers_200ind.rdata")
#Next time you need the gl object with filtered data,
#you can just load the genlight object you saved by using the function "load"

load("files/gl5_575markers_200ind.rdata")

 

5.5 Geographic origin

Since we are analyzing a single Country, we will use another Administrative subdivision, in this case “Province”.

library(maps)
map_Zambia <- borders("world", 
                      regions = c("Zambia","adm1"),
                      fill = "transparent", colour = "black")

points_Zambia <- read.csv("files/ind.metafile_736m_200ind.csv")

library(ggplot2)
#Map using ggplot and automatic colors
ggplot() + map_Zambia +
  geom_point(data = points_Zambia, 
             aes(x = lon, y = lat, colour = province)) + # Plot points of location of the samples 
  scale_color_discrete("Province") +  # Change legend title
  xlab("Longitude") +  # X-axis label
  ylab("Latitude")+ # Y-axis label
  theme_light()+
  coord_fixed(ratio = 1)
Geographic origin of the samples

Figure 5.1: Geographic origin of the samples

 

6 Genetic distance and NJ tree

To visualize genetic relationships, similarity and distance we use a “tree”

 

6.1 With a dartR function

dartR has its own functions to calculate genetic distances. This genetic distance is an estimation of how close two individuals (or samples) are genetically.

  • The function gl.dist.pop will calculate the genetic distance between population “pop”.

  • The function gl.dist.ind will calculate the genetic distance between individuals “id”

6.1.1 Calculate the genetic distance

Genetic distance used: Jaccard distance

Jaccard_silicoDArT_200ind <- gl.dist.ind(gl5, method = "jaccard")
## Starting gl.dist.ind 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Calculating distances based on the Jaccard Coefficient

##   Returning a stat::dist object
## Completed: gl.dist.ind

6.1.2 Clustering

Clustering method used: Neighbor joining

Originally written by Saitou and Nei in 1987, they stated:

“The principle of this method is to find pairs of operational taxonomic units (OTUs) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree”

####----Load library to use---####

library(ape)

####----Solve clustering---####

tree_Silico_Swp <- nj(Jaccard_silicoDArT_200ind)

6.1.3 Plot tree

####----Load library to use---####

library(ggtree)
####----Plot basic tree---####

ggtree(tree_Silico_Swp)+
  geom_tiplab(aes(label=label), color="black",size=1.5)
## Warning: The tree contained negative edge lengths. If you want to ignore the edges,
## you can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

 

6.1.4 Save plot

To save the last plot produced you can use the following code:

ggsave("plots/Dendrogram_Zambia_575m_200ind.png", width = 8, height = 20)

# width and height can be changed to fit the size and appearence you want

6.1.5 Problems found in the dendrogram

Warning message of negative edge lengths

Figure 6.1: Warning message of negative edge lengths

 

Parts of the dendrogram with negative edge lengths

Figure 6.2: Parts of the dendrogram with negative edge lengths

 

Suspicious individuals at the base of the dendrogram

Figure 6.3: Suspicious individuals at the base of the dendrogram

 

6.1.6 Erasing problematic samples

For the moment we need to erase some samples for this analysis, since they need more work.

It is necessary to create a file with the samples we want to eliminate

File containing the list of samples to drop

Figure 6.4: File containing the list of samples to drop

We need to read the file and use the function gl.drop.ind

samples_to_drop <- readLines("files/samples_to_drop.csv")

gl6 <- gl.drop.ind(gl5, ind.list = samples_to_drop, recalc = T)
## Starting gl.drop.ind 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Deleting specified individuals
##   Warning: Resultant dataset may contain monomorphic loci
##   Recalculating locus metrics
## Completed: gl.drop.ind

6.1.7 Imputing missing data

gl6_imputed <- gl.impute(gl6)
## Starting gl.impute 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Imputation based on drawing from the nearest neighbour
##   Residual missing values were filled randomly drawing from the global allele profiles by locus
## Completed: gl.impute

6.1.8 Repeat the process to plot a tree

####----Calculate distance---####

Jaccard_silicoDArT_imputed <- gl.dist.ind(gl6_imputed, method = "jaccard")
## Starting gl.dist.ind 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Calculating distances based on the Jaccard Coefficient
##   Returning a stat::dist object
## Completed: gl.dist.ind
####----Solve clustering---####

tree_Silico_Swp_imputed <- nj(Jaccard_silicoDArT_imputed)

####----Plot imputed data tree---####

#options(ignore.negative.edge=TRUE) #use this line to ignore negative edges

ggtree(tree_Silico_Swp_imputed)+
  geom_tiplab(aes(label=label), color="black",size=1.5)

6.1.9 Annotating the tree

We will plot a Neighbor-Joining tree with the package ggtree^ [@R-ggtree] based on ggplot2 [@R-ggplot2]

GLOSSARY:

Annotate a tree:

Annotating for phylogenetics is basically give color, format and style in general like colouring branches or highlighting clades.

#Loading the libraries ggtree and ape

library(ggtree)

library(ape)

library(dplyr)
#To annotate the tree, we need to load again our ind.metafile. 
#You can add any additional information you want. 
#We will storage this information as "info".

info <- read.csv("files/ind.metafile_736m_200ind.csv")
#Plotting the NJ tree with ggtree


ggtree(tree_Silico_Swp_imputed, layout="rectangular")%<+% info %>% +
  geom_point(size=1, aes(color=province))+
  geom_tree(aes(color=province))+
  geom_tiplab(aes(label=label), color="black",size=1.5)+
  scale_color_manual(values = c("#219C90",
                                "#B931FC", 
                                "#EE9322", 
                                "#D83F31"),
                    labels = c('Eastern province', 
                               'Southern province', 
                               'North-western province', 
                               "Copperbelt province"))

ggsave("plots/Dendrogram_Zambia_575m_155ind_colored.png", width = 8, height = 20)
  
ggtree(tree_Silico_Swp_imputed, layout="circular")%<+% info %>% +
  geom_point(size=1, aes(color=province))+
  geom_tree(aes(color=province))+
  geom_tiplab(aes(label=label), color="black",size=1.5)+
  scale_color_manual(values = c("#219C90",
                                "#B931FC", 
                                "#EE9322", 
                                "#D83F31"),
                    labels = c('Eastern province', 
                               'Southern province', 
                               'North-western province', 
                               "Copperbelt province"))

ggsave("plots/Dendrogram_Zambia_575m_155ind_colored_circular.png", width = 20, height = 20)

 

 

[More about ggtree and all its options? Go to Chapter 4.2 Visualization and annotation of phylogenetic trees:ggtree (https://guangchuangyu.github.io/ggtree-book/chapter-ggtree.html)

 

NOTE:

You can actually calculate the distance between populations using the gl.dis.pop function.

Since this dataset is only made by samples from Zambia, the pop can be redefine by some other information, such as “province”.

####----Redefine the population information---####

gl7 <- gl6_imputed

pop(gl7) <- gl7@other$ind.metrics$province

####----Euclidean distances between populations---####

Jaccard_silicoDArT_pop <- gl.dist.pop(gl7, method = "euclidean")

 


 

 

BEYOND THIS WORKSHOP

The following chapters are not covered in the Workshop but are useful documentation to keep developing your abilities in molecular data analysis with DArtseq markers. We strongly recommend that you follow the instructions for the workshop with SilicoDArT first.

 

7 Analysis for SNPs

The SNP analysis process is very similar to the one we followed for SilicoDArT.

SNP markers

Single nucleotide polymorphisms (SNPs) are co-dominant markers that represent variation in specific nucleotides positions. DArTSeq sequencing technology allows for simultaneous SNP discovery and characterization within a large number of samples. They are presented as biallelic (only two possible allele variants).

Diversity Arrays provides two different datasets for SNPs.

  • SNPs (Single Nucleotide Polymorphism)
  • One-Row format
  • Two-Row format

 

7.0.1 Two-Row SNPs dataset (0,1)

Here, each of the two rows correspond to one of the two alleles. These two alleles are named as Reference allele and SNP allele (or alternative allele)

  • |1|1| Homozygous (AA)
  • |0|0| Homozygous (BB)
  • |1|0| or |0|1| Heterozygous (AB)
  • |-|-| Missing data

 

 


THE TWO-ROW SNPs REPORT

 

The following is a colored example of the report for SNPs and its components, opened in an Excel spreadsheet.

The report includes columns with the following information:

  • In blue are columns with marker IDs and sequence,
  • In yellow are columns with mapping position and scores,
  • In green are columns with markers stats (CallRate, Frequencies, RepAvg, etc).

 

Columns with markers stats (The green columns)

 

 

Biallelic markers /Missing-data

For SNPs, additional to CloneID information, we have an AlleleID code. This contains information about specific single nucleotide variation inside the CloneID.

 

 


 

7.0.2 One-Row SNPs dataset (0,1,2)

SNP markers are also presented in One-Row format

  • 1 Homozygous (AA)
  • 0 Homozygous (BB)
  • 2 Heterozygous (AB)
  • - Missing data

 

 

7.1 Quality analysis for SNPs

 

7.1.1 Loading the SNP csv file:

Based on our experience working with SilicoDArT data we will work with 155 samples to be able tocompare the results between SNPs and SilicoDArTs

 

#####--------SNPs Input--------####

gl_snp_sp <- gl.read.dart(filename="files/SNPs_Swp_6037markers_155ind_2Row.csv",
                          nas = "-",
                          ind.metafile="files/ind.metafile_6037markers_155ind.csv")
## Starting gl.read.dart 
## Starting utils.read.dart 
##   Topskip not provided.
##  Setting topskip to 6 .
##   Reading in the SNP data
##   Detected 2 row format.
## Number of rows per clone (should be only  2 s): 2 
##  Added the following locus metrics:
## AlleleID CloneID AlleleSequence TrimmedSequence Chrom_SweetPotato_NSP323_v3 ChromPosTag_SweetPotato_NSP323_v3 ChromPosSnp_SweetPotato_NSP323_v3 AlnCnt_SweetPotato_NSP323_v3 AlnEvalue_SweetPotato_NSP323_v3 Strand_SweetPotato_NSP323_v3 SNP SnpPosition CallRate OneRatioRef OneRatioSnp FreqHomRef FreqHomSnp FreqHets PICRef PICSnp AvgPIC AvgCountRef AvgCountSnp RepAvg .
## Recognised: 155 individuals and 6037  SNPs in a 2 row format using files/SNPs_Swp_6037markers_155ind_2Row.csv 
## Completed: utils.read.dart 
## Starting utils.dart2genlight 
## Starting conversion....
## Format is 2 rows.
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using save(object, file="object.rdata")
## Adding individual metrics: files/ind.metafile_6037markers_155ind.csv .
##   Ids for individual metadata (at least a subset of) are matching!
##   Found  155 matching ids out of 155 ids provided in the ind.metadata file.
##   Added population assignments.
##   Added latlon data.
##  Added  id  to the other$ind.metrics slot.
##   Added  pop  to the other$ind.metrics slot.
##   Added  num_in_file  to the other$ind.metrics slot.
##   Added  lat  to the other$ind.metrics slot.
##   Added  lon  to the other$ind.metrics slot.
##   Added  INSTCODE  to the other$ind.metrics slot.
##   Added  ACCENUMB  to the other$ind.metrics slot.
##   Added  acc_name  to the other$ind.metrics slot.
##   Added  collection_site  to the other$ind.metrics slot.
##   Added  province  to the other$ind.metrics slot.
##   Added  elevation  to the other$ind.metrics slot.
##   Added  eval  to the other$ind.metrics slot.
##   Added  copy  to the other$ind.metrics slot.
##   Added  duplicates  to the other$ind.metrics slot.
##   Added  Cluster6  to the other$ind.metrics slot.
##   Added  Cluster8  to the other$ind.metrics slot.
##   Added  pop2  to the other$ind.metrics slot.
##   Added  pop3  to the other$ind.metrics slot.
##   Added  Plate  to the other$ind.metrics slot.
##   Added  Genotype  to the other$ind.metrics slot.
## Completed: utils.dart2genlight 
##  Data read in. Please check carefully the output above
##   Read depth calculated and added to the locus metrics
##   Minor Allele Frequency (MAF) calculated and added to the locus metrics
##   Recalculating locus metrics provided by DArT (optionally specified)
## Completed: gl.read.dart
#To keep the genlight object with 155 individuals untouched.

gl_snp1 <- gl_snp_sp 

7.2 Reports for quality for SNPs

####----- Reports for data quality-----####

# Report for Monomorphs

gl.report.monomorphs(gl_snp1)
## Starting gl.report.monomorphs 
##   Processing genlight object with SNP data
##   Identifying monomorphic loci
## 
##   No. of loci: 6037 
##     Polymorphic loci: 5240 
##     Monomorphic loci: 797 
##     Loci scored all NA: 0 
##   No. of individuals: 155 
##   No. of populations: 1 
## 
## Completed: gl.report.monomorphs
# Report for Reproducibility

gl.report.reproducibility(gl_snp1)
## Starting gl.report.reproducibility 
##   Processing genlight object with SNP data
##   Reporting Repeatability by Locus
##   No. of loci = 6037 
##   No. of individuals = 155 
##     Minimum      :  0.86 
##     1st quartile :  0.944444 
##     Median       :  0.984375 
##     Mean         :  0.9698738 
##     3r quartile  :  1 
##     Maximum      :  1 
##     Missing Rate Overall:  0.3
##    Quantile Threshold Retained Percent Filtered Percent
## 1      100%  1.000000     2906    48.1     3131    51.9
## 2       95%  1.000000     2906    48.1     3131    51.9
## 3       90%  1.000000     2906    48.1     3131    51.9
## 4       85%  1.000000     2906    48.1     3131    51.9
## 5       80%  1.000000     2906    48.1     3131    51.9
## 6       75%  1.000000     2906    48.1     3131    51.9
## 7       70%  1.000000     2906    48.1     3131    51.9
## 8       65%  1.000000     2906    48.1     3131    51.9
## 9       60%  1.000000     2906    48.1     3131    51.9
## 10      55%  1.000000     2906    48.1     3131    51.9
## 11      50%  0.984375     3025    50.1     3012    49.9
## 12      45%  0.975000     3327    55.1     2710    44.9
## 13      40%  0.967391     3626    60.1     2411    39.9
## 14      35%  0.960000     3927    65.0     2110    35.0
## 15      30%  0.952381     4239    70.2     1798    29.8
## 16      25%  0.944444     4553    75.4     1484    24.6
## 17      20%  0.936170     4832    80.0     1205    20.0
## 18      15%  0.926829     5136    85.1      901    14.9
## 19      10%  0.914894     5438    90.1      599     9.9
## 20       5%  0.897727     5738    95.0      299     5.0
## 21       0%  0.860000     6037   100.0        0     0.0
## Completed: gl.report.reproducibility

# Report for CallRate

gl.report.callrate(gl_snp1, method="loc")
## Starting gl.report.callrate 
##   Processing genlight object with SNP data
##   Reporting Call Rate by Locus
##   No. of loci = 6037 
##   No. of individuals = 155 
##     Minimum      :  0.154839 
##     1st quartile :  0.548387 
##     Median       :  0.735484 
##     Mean         :  0.7047668 
##     3r quartile  :  0.883871 
##     Maximum      :  1 
##     Missing Rate Overall:  0.2952
##    Quantile Threshold Retained Percent Filtered Percent
## 1      100%  1.000000      208     3.4     5829    96.6
## 2       95%  0.993548      340     5.6     5697    94.4
## 3       90%  0.974194      629    10.4     5408    89.6
## 4       85%  0.948387      946    15.7     5091    84.3
## 5       80%  0.916129     1275    21.1     4762    78.9
## 6       75%  0.883871     1555    25.8     4482    74.2
## 7       70%  0.858065     1817    30.1     4220    69.9
## 8       65%  0.825806     2140    35.4     3897    64.6
## 9       60%  0.793548     2442    40.5     3595    59.5
## 10      55%  0.761290     2779    46.0     3258    54.0
## 11      50%  0.735484     3027    50.1     3010    49.9
## 12      45%  0.703226     3325    55.1     2712    44.9
## 13      40%  0.664516     3680    61.0     2357    39.0
## 14      35%  0.632258     3936    65.2     2101    34.8
## 15      30%  0.593548     4248    70.4     1789    29.6
## 16      25%  0.548387     4559    75.5     1478    24.5
## 17      20%  0.496774     4860    80.5     1177    19.5
## 18      15%  0.445161     5142    85.2      895    14.8
## 19      10%  0.380645     5446    90.2      591     9.8
## 20       5%  0.316129     5753    95.3      284     4.7
## 21       0%  0.154839     6037   100.0        0     0.0
## 
## 
## Completed: gl.report.callrate

7.3 Filtering markers for SNPs

gl.report.callrate(gl_snp1, method="ind")
## Starting gl.report.callrate 
##   Processing genlight object with SNP data
##   Reporting Call Rate by Individual
##   No. of loci = 6037 
##   No. of individuals = 155 
##     Minimum      :  0.601292 
##     1st quartile :  0.6584396 
##     Median       :  0.6983601 
##     Mean         :  0.7047668 
##     3r quartile  :  0.7382806 
##     Maximum      :  0.9011098 
##     Missing Rate Overall:  0.2952
##    Quantile Threshold Retained Percent Filtered Percent
## 1      100% 0.9011098        1     0.6      154    99.4
## 2       95% 0.8255591        8     5.2      147    94.8
## 3       90% 0.7907239       16    10.3      139    89.7
## 4       85% 0.7731655       24    15.5      131    84.5
## 5       80% 0.7533212       31    20.0      124    80.0
## 6       75% 0.7382806       39    25.2      116    74.8
## 7       70% 0.7290045       47    30.3      108    69.7
## 8       65% 0.7215007       54    34.8      101    65.2
## 9       60% 0.7131025       62    40.0       93    60.0
## 10      55% 0.7047871       70    45.2       85    54.8
## 11      50% 0.6983601       78    50.3       77    49.7
## 12      45% 0.6926619       85    54.8       70    45.2
## 13      40% 0.6882889       93    60.0       62    40.0
## 14      35% 0.6775219      101    65.2       54    34.8
## 15      30% 0.6663575      108    69.7       47    30.3
## 16      25% 0.6584396      116    74.8       39    25.2
## 17      20% 0.6454862      124    80.0       31    20.0
## 18      15% 0.6367732      131    84.5       24    15.5
## 19      10% 0.6225609      139    89.7       16    10.3
## 20       5% 0.6077688      147    94.8        8     5.2
## 21       0% 0.6012920      155   100.0        0     0.0
## 
## 
##  ind_name    pop      missing_data
##    ZM8972 Zambia 0.901109822759649
##    ZM8998 Zambia 0.895477886367401
##    ZM8995 Zambia 0.881398045386782
##    ZM8993 Zambia 0.853569653801557
##    ZM8970 Zambia 0.841974490641047
##    ZM8980 Zambia 0.834851747556733
##    ZM9171 Zambia   0.8295511015405
##    ZM8986 Zambia 0.827066423720391
##    ZM8974 Zambia 0.824913036276296
##    ZM9006 Zambia 0.813980453867815
##    ZM8984 Zambia 0.813814808679808
##    ZM8996 Zambia 0.809342388603611
##    ZM8979 Zambia 0.807520291535531
##    ZM9012 Zambia 0.802219645519298
##    ZM8985 Zambia 0.794103031306941
##    ZM8981 Zambia 0.790790127546795
##    ZM9007 Zambia 0.790624482358787
##    ZM8999 Zambia 0.782010932582409
##    ZM9011 Zambia 0.781845287394401
##    ZM9002 Zambia 0.779526254762299
##    ZM8977 Zambia 0.775385125062117
##    ZM9003 Zambia 0.775053834686102
##    ZM9055 Zambia 0.774059963558059
##    ZM9051 Zambia 0.773231737618022
##    ZM8994 Zambia 0.772569156865993
##    ZM9193 Zambia 0.770250124233891
##    ZM9046 Zambia 0.769421898293855
##    ZM9000 Zambia  0.76478383302965
##    ZM9200 Zambia 0.763955607089614
##    ZM9035 Zambia 0.759483187013417
##    ZM9192 Zambia 0.756501573629286
##    ZM9010 Zambia 0.752526089117111
##    ZM9019 Zambia 0.752029153553089
##    ZM9073 Zambia 0.751532217989067
##    ZM9060 Zambia 0.751035282425046
##    ZM9134 Zambia 0.751035282425046
##    ZM9065 Zambia 0.750703992049031
##    ZM9170 Zambia 0.744740765280769
##    ZM9168 Zambia 0.739440119264535
##    ZM8991 Zambia 0.737121086632433
##    ZM9163 Zambia 0.736292860692397
##    ZM9187 Zambia 0.732317376180222
##    ZM8982 Zambia 0.732317376180222
##    ZM9165 Zambia 0.729667053172105
##    ZM9001 Zambia 0.729667053172105
##    ZM9009 Zambia 0.729335762796091
##    ZM9013 Zambia 0.729170117608083
##    ZM9211 Zambia 0.728341891668047
##    ZM8975 Zambia 0.727182375351996
##  ZM8963_2 Zambia 0.726685439787974
##    ZM9092 Zambia 0.726519794599967
##  ZM8965_2 Zambia 0.726188504223952
##    ZM9204 Zambia 0.725194633095908
##    ZM9220 Zambia 0.722544310087792
##    ZM9161 Zambia 0.721384793771741
##    ZM8978 Zambia 0.720887858207719
##    ZM9004 Zambia 0.720722213019712
##    ZM9205 Zambia  0.72022527745569
##    ZM9155 Zambia 0.718734470763624
##    ZM9166 Zambia 0.717906244823588
##    ZM9005 Zambia 0.717078018883551
##    ZM9196 Zambia 0.716084147755508
##    ZM8990 Zambia 0.711114792115289
##    ZM9147 Zambia 0.710783501739275
##    ZM9207 Zambia 0.710783501739275
##    ZM9191 Zambia 0.709458340235216
##    ZM8987 Zambia 0.709292695047209
##    ZM9053 Zambia 0.705482855723041
##    ZM9214 Zambia 0.705151565347027
##    ZM9158 Zambia 0.704985920159019
##    ZM9209 Zambia  0.70432333940699
##    ZM9169 Zambia 0.703826403842968
##    ZM9185 Zambia 0.703660758654961
##  ZM8964_2 Zambia 0.703660758654961
##    ZM9085 Zambia 0.703495113466954
##    ZM9199 Zambia 0.703495113466954
##    ZM9181 Zambia 0.701010435646845
##    ZM9162 Zambia 0.698360112638728
##  ZM8966_2 Zambia 0.697863177074706
##    ZM9029 Zambia 0.696538015570648
##    ZM9110 Zambia 0.696206725194633
##    ZM9215 Zambia 0.695709789630611
##    ZM9067 Zambia 0.695047208878582
##    ZM9167 Zambia  0.69455027331456
##    ZM9201 Zambia 0.692893821434487
##    ZM9195 Zambia 0.692562531058473
##    ZM9115 Zambia 0.692231240682458
##    ZM9037 Zambia 0.692065595494451
##    ZM9106 Zambia 0.691568659930429
##    ZM8971 Zambia 0.690243498426371
##    ZM9038 Zambia 0.690077853238363
##    ZM8973 Zambia 0.689912208050356
##    ZM8983 Zambia 0.688752691734305
##  ZM8965_1 Zambia 0.687593175418254
##    ZM9208 Zambia  0.68726188504224
##    ZM8992 Zambia 0.686764949478218
##    ZM9206 Zambia 0.686102368726188
##    ZM9103 Zambia 0.683286400530065
##    ZM9027 Zambia 0.682789464966043
##    ZM9213 Zambia 0.679476561205897
##    ZM9146 Zambia 0.677820109325824
##    ZM8988 Zambia 0.674838495941693
##    ZM8968 Zambia 0.672188172933576
##    ZM9182 Zambia 0.670366075865496
##  ZM8964_1 Zambia 0.670200430677489
##    ZM9036 Zambia 0.669869140301474
##    ZM9091 Zambia 0.669040914361438
##    ZM9117 Zambia 0.666887526917343
##    ZM9177 Zambia 0.666224946165314
##    ZM8997 Zambia 0.665562365413285
##    ZM8976 Zambia 0.665396720225278
##    ZM9084 Zambia 0.662249461653139
##    ZM9083 Zambia 0.661089945337088
##    ZM9194 Zambia 0.661089945337088
##    ZM9097 Zambia 0.660593009773066
##    ZM9101 Zambia 0.658770912704986
##    ZM9016 Zambia 0.658108331952957
##    ZM9156 Zambia 0.657942686764949
##    ZM9128 Zambia 0.653967202252774
##    ZM9023 Zambia 0.652973331124731
##    ZM9056 Zambia 0.652807685936724
##    ZM9066 Zambia 0.650819943680636
##    ZM9107 Zambia 0.648169620672519
##    ZM9064 Zambia  0.64568494285241
##    ZM9189 Zambia 0.644691071724366
##    ZM9033 Zambia 0.642206393904257
##    ZM9144 Zambia 0.641709458340235
##    ZM9145 Zambia 0.641378167964221
##    ZM9086 Zambia 0.641212522776213
##    ZM9221 Zambia 0.638562199768097
##    ZM9021 Zambia 0.637071393076031
##  ZM8963_1 Zambia 0.636740102700017
##    ZM9151 Zambia 0.633924134503893
##    ZM9041 Zambia 0.631770747059798
##    ZM9099 Zambia 0.630942521119761
##  ZM8966_1 Zambia  0.63044558555574
##    ZM9039 Zambia 0.626967036607587
##    ZM9069 Zambia 0.624979294351499
##    ZM9043 Zambia 0.624151068411463
##    ZM9089 Zambia 0.621500745403346
##    ZM9057 Zambia 0.619181712771244
##    ZM9178 Zambia 0.617359615703164
##    ZM9017 Zambia 0.616200099387113
##    ZM9063 Zambia 0.613218486002982
##    ZM9042 Zambia 0.610071227430843
##    ZM9153 Zambia 0.609574291866821
##    ZM9149 Zambia 0.607917839986748
##    ZM9114 Zambia 0.607420904422727
##    ZM9059 Zambia 0.605598807354646
##    ZM9081 Zambia 0.604604936226603
##    ZM9109 Zambia 0.603611065098559
##    ZM9014 Zambia 0.603279774722544
##    ZM9098 Zambia 0.602782839158522
##    ZM9188 Zambia 0.602451548782508
##    ZM9212 Zambia 0.601292032466457
## Completed: gl.report.callrate

The filters for SNPs markers are very similar to the ones used for silicoDArT markers, except that with SNPs markers, the MAF can be calculated.

#Filtering Monomorphs for snps

gl_snp2 <- gl.filter.monomorphs(gl_snp1, v=0)

nLoc(gl_snp2)
## [1] 5240
#Filtering Reproducibility for snps

gl_snp3 <- gl.filter.reproducibility(gl_snp2, t=0.95)
## Starting gl.filter.reproducibility 
##   Processing genlight object with SNP data
##   Removing loci with repeatability less than 0.95
## Completed: gl.filter.reproducibility
nLoc(gl_snp3)
## [1] 3552
#Filtering Callrate per individual for snps

gl_snp4 <- gl.filter.callrate(gl_snp3, 
                              method = "loc", 
                              threshold = 0.60, 
                              recalc = T)
## Starting gl.filter.callrate 
##   Processing genlight object with SNP data
##   Recalculating Call Rate
##   Removing loci based on Call Rate, threshold = 0.6 
## Starting gl.recalc.metrics 
##   Processing genlight object with SNP data
## Starting utils.recalc.avgpic 
##   Processing genlight object with SNP data
##   Recalculating OneRatioRef, OneRatioSnp, PICRef, PICSnp, AvgPIC
## Completed: utils.recalc.avgpic 
## Starting utils.recalc.callrate 
##   Processing genlight object with SNP data
##   Recalculating locus metric CallRate
## Completed: utils.recalc.callrate 
## Starting utils.recalc.maf 
##   Processing genlight object with SNP data
##   Recalculating FreqHoms and FreqHets
## Starting utils.recalc.freqhets 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHets
## Completed: utils.recalc.freqhets 
## Starting utils.recalc.freqhomref 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomRef
## Completed: utils.recalc.freqhomref 
## Starting utils.recalc.freqhomsnp 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomSnp
## Completed: utils.recalc.freqhomsnp 
##   Recalculating Minor Allele Frequency (MAF)
## Completed: utils.recalc.maf 
##   Locus metrics recalculated
## Completed: gl.recalc.metrics
## Completed: gl.filter.callrate
nLoc(gl_snp4)
## [1] 2228
nInd(gl_snp4)
## [1] 155
#Filtering Callrate per marker for snps

gl_snp5 <- gl.filter.callrate(gl_snp4,
                              method = "loc",
                              threshold = 0.70,
                              recalc = T)
## Starting gl.filter.callrate 
##   Processing genlight object with SNP data
##   Recalculating Call Rate
##   Removing loci based on Call Rate, threshold = 0.7 
## Starting gl.recalc.metrics 
##   Processing genlight object with SNP data
## Starting utils.recalc.avgpic 
##   Processing genlight object with SNP data
##   Recalculating OneRatioRef, OneRatioSnp, PICRef, PICSnp, AvgPIC
## Completed: utils.recalc.avgpic 
## Starting utils.recalc.callrate 
##   Processing genlight object with SNP data
##   Recalculating locus metric CallRate
## Completed: utils.recalc.callrate 
## Starting utils.recalc.maf 
##   Processing genlight object with SNP data
##   Recalculating FreqHoms and FreqHets
## Starting utils.recalc.freqhets 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHets
## Completed: utils.recalc.freqhets 
## Starting utils.recalc.freqhomref 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomRef
## Completed: utils.recalc.freqhomref 
## Starting utils.recalc.freqhomsnp 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomSnp
## Completed: utils.recalc.freqhomsnp 
##   Recalculating Minor Allele Frequency (MAF)
## Completed: utils.recalc.maf 
##   Locus metrics recalculated
## Completed: gl.recalc.metrics
## Completed: gl.filter.callrate
nLoc(gl_snp5)
## [1] 1700

7.4 Filtering by Minor Allele Frequency

GLOSSARY:

MAF

MAF or Minor Allele Frequency: Refers to the frequency at which the second most common allele occurs in a given population, provides information to differentiate between common and rare variants, so it is widely used in population genetics studies.

gl_snp6 <- gl.filter.maf(gl_snp5, threshold = 0.05)
## Starting gl.filter.maf 
##   Processing genlight object with SNP data
##   Removing loci with MAF < 0.05 over all the dataset 
##                 and recalculating FreqHoms and FreqHets
## Completed: gl.filter.maf
nLoc(gl_snp6)
## [1] 803

7.5 Save filtered data

#Save gl object for SNP data with filtered data

save(gl_snp6, file="gl_snp6_803m_155ind.rdata")
#load the data when needed

load("gl_snp6_803m_155ind.rdata")
#Load the libraries "poppr" and "ape"

library(poppr)

library(ape)

7.6 From genlight to a genind object

GLOSSARY:

A genind object

The package adegenet (dartR is based on this package) uses two types of object to storage genetic information: genind is one of them and it is used for individual genotypes.

Unfortunately the function aboot only allows genind objects to use “Nei distance”, which is what we need to work with SNP markers in general.

#Converting genlight into genind.

gi_snp6 <- gl2gi(gl_snp6)
## Starting gl2gi 
##   Processing genlight object with SNP data
## Matrix converted.. Prepare genind object...
## Completed: gl2gi

7.7 Distance and NJ tree for SNPs

phy_803m_155ind <- aboot(gi_snp6,
                         tree="nj",
                         dist="nei.dist",
                         sample = 100, #number of iteration for bootstrap
                         missing = "mean",
                         showtree=FALSE,
                         root=FALSE)
## 
Running bootstraps:       100 / 100
## Calculating bootstrap values... done.
library(ggtree)
library(ape)
info <- read.csv("files/ind.metafile_6037markers_155ind.csv")
ggtree(phy_803m_155ind, layout="rectangular") %<+% info %>% +
  geom_point(size=1, aes(color=province))+
  geom_tree(aes(color=province))+
  geom_tiplab(aes(label=label), color="black",size=1.5)+
  scale_color_manual(values = c("#219C90",
                                "#B931FC", 
                                "#EE9322", 
                                "#D83F31"),
                    labels = c('Eastern province', 
                               'Southern province', 
                               'North-western province', 
                               "Copperbelt province"))

ggsave("plots/Dendrogram_Zambia_803m_SNPs_155ind_colored.png", width = 8, height = 20)

 

 

8 Principal Coordinates Analysis - PCoA

PCoA is a method to analyze and visualize similarities or dissimilarities of data.By using PCoA we can visualize individual and population differences.

NOTE:

There is also a method called ‘Principal Component Analysis’ or “PCA” which is different from PCoA.

8.1 For SilicoDArT markers

####----Redefine the population information---####

gl_silico7 <- gl6_imputed

pop(gl_silico7) <- gl_silico7@other$ind.metrics$province

####----Pcoa---####

pcoa_silico <- gl.pcoa(gl_silico7, nfactors=10)
## Starting gl.pcoa 
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Performing a PCA, individuals as entities, loci as attributes, Tag P/A as state
## Completed: gl.pcoa

gl.pcoa.plot(pcoa_silico,
             gl_silico7,
             xaxis=1,
             yaxis=2)
## Starting gl.pcoa.plot 
##   Processing an ordination file (glPca)
##   Processing genlight object with Presence/Absence (SilicoDArT) data
##   Plotting populations in a space defined by the presence/absence data
##   Preparing plot .... please wait
## Completed: gl.pcoa.plot

8.2 For SNP markers

####----Redefine the population information---####

gl_snp7 <- gl_snp6

pop(gl_snp7) <- gl_snp7@other$ind.metrics$province

####----Pcoa---####

pcoa_snps <- gl.pcoa(gl_snp7, nfactors=10)
## Starting gl.pcoa 
##   Processing genlight object with SNP data
##   Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state
## Completed: gl.pcoa

gl.pcoa.plot(pcoa_snps,
             gl_snp7,
             xaxis=1,
             yaxis=2)
## Starting gl.pcoa.plot 
##   Processing an ordination file (glPca)
##   Processing genlight object with SNP data
##   Plotting populations in a space defined by the SNPs
##   Preparing plot .... please wait
## Completed: gl.pcoa.plot

 

9 Other options

There are other options different from R. Here some options to explore.

The product From the Genomic Open-source Breeding informatics initiative (GOBii)

http://cbsugobii05.biohpc.cornell.edu/wordpress/index.php/products-v2/

9.1 For filtering DArTseq datasets

Among all of GOBii’s products, A tool stands up:

  • DartView from KDXplore

  • To download

http://cbsugobii05.biohpc.cornell.edu/wordpress/index.php/dartview/

  • Documentation:

https://software.kddart.com/kdxplore/dartview/dartviewdocs/KDXplore-DartView.html

9.2 For distance and plotting

Also specially made for DArTseq markers

  • KDCompute

  • Documentation

https://gobiiproject.atlassian.net/wiki/spaces/GD/pages/41222145/Data+QC

https://www.kddart.org/help/kdcompute/KDComputeHelp.html